Business Analytics

Ensemble Methods

Ayush Patel and Jayati Sharma

10 April, 2024

Pre-requisite

You already….

  • Understand linear regression
  • Learnt about decision trees

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn about ensemble methods
    • Bagging
    • Boosting
    • Random Forests

Ensemble Methods - Why?

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • In the previous lecture, you learnt about decision trees
  • Decision trees can be non-robust and not highly accurate
  • Hence, you can aggregate many decision trees
  • These methods are bagging, boosting and random forests

Ensemble Methods - What?

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • An ensemble method combines many simple “building block” models in order to obtain a single and potentially very powerful model
  • In simple words, instead of using a single model, this approach improves the accuracyin results by combining many models
  • The “simple building block” models are also known as weak learners, since on their own, they may lead to mediocre predictions with less accuracy

Bagging

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • Decision trees have high variance
  • Why? Because when you split your data into training and testing data to fit a decision tree, the results would have high contrast
  • Hence, a model with low variance would will give similar results when applied to distinct data
  • Bootstrap aggregation or bagging is an approach to reduce this variance

Bagging

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • Bagging involves taking many training sets, building a model using each and then taking an average of the predictions
  • Why? Because averaging a set of observations reduces variance
  • So, essentially bagging builds many such models and averages their results to reduce the variance that comes from a single model
  • The average of predictions by separate training models \(f^1 (x)\), \(f^2 (x)\), …… \(f^B (x)\) is given by

\[\hat f_{avg} (x) = 1/B sum_{b=1}^B \hat f^b (x)\]

Bagging

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

Hold up..

  • We won’t always have so many datasets
  • So how would we able to generate so many training sets?
  • Simply put, we take repeated samples from the single training set
  • ‘B’ regression trees are constructed using ‘B’ bootstrapped training sets, and we take an average of the resulting predictions

Bagging

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • This way, each individual tree has a high variance
  • But averaging reduces the variance
  • Recall the bias between flexibility and interpretability?
  • While bagging improves accuracy, it becomes more complex to interpret than decision trees

Boosting - What?

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • The second method of boosting is similar to bagging yet takes a slightly different approach
  • Bagging involves creating many models from the training data, each tree being independent of the other
  • Boosting involves something similar; except here the trees are grown sequentially
  • A new tree is made using information from the previous tree, and improves it

Boosting

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • Here also, there are many decision trees \(f^1 (x)\), \(f^2 (x)\), …… \(f^B (x)\)
  • However, instead of making a large decision tree which overfits, here the model learns step-by-step
  • A tree is built, and then a decision tree is fit based on the residuals (misclassified)
  • This new tree is added which updates the residuals
  • This way, by fitiing small trees to the residuals, \(\hat f\) is imporved upon wherever it does not perform well

Boosting

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

Boosting has three tuning parameters

  • The number of trees - if B is too large, it can also overfit
  • Parameter \(\lambda\) - This is the shrinkage parameter which controls the rate at which boosting learns
  • a small positve number
  • Can you imagine what an extremely low or high value would imply for the number of decision trees?
  • d - the number of splits in each tree, which controls the complexity of the boosted model
    • Often d = 1 works well

Random Forests

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • The third method of Random forest also includes improving accuracy of the bagging approach
  • Random forests are an improvement on the bagging approach
  • How? It decorrelates the trees in the bagging method
  • In addition to creating bootstrapped samples, it also randomly selects a subset of features based at the split of the trees
  • That is, each time a split is considered, a random sample of m predictors are chosen from the total of p predictors
  • A new sample of the m predictors is taken at each split
  • The number of predictors i.e value of is selected as $ m p$
  • Is this is so, can you find what could be one possible value of m in a dataset that has 33 predictors?

Random Forests

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • Hence, while building a random forest, not all predictors are chosen together
  • Suppose a dataset has one extremely strong predictor and other moderately strong predictors
  • Under the bagging approach, this strong predictor would be the top split in majority of the trees
  • This makes the trees similar, and correlates the predictions
  • Also does not reduce the variance

Random Forests

Content for this topic has been sourced from An Introduction to Statistical Learning. Please check out the work for detailed information.

  • In such a case, random forests only considering a subset of predictors solves the problem
  • This decorrelates the trees
  • Makes the average of the predictions with less variance and more reliability
  • By seeing the only difference between bagging and random forests approach, at what point would the random forest built be equal to the one built by bagging?
  • At m = p

Thank You :)